Self Training Wrapper Induction with Linked Data

نویسندگان

  • Anna Lisa Gentile
  • Ziqi Zhang
  • Fabio Ciravegna
چکیده

This work explores the usage of Linked Data for Web scale Information Extraction, with focus on the task of Wrapper Induction. We show how to effectively use Linked Data to automatically generate training material and build a self-trained Wrapper Induction method. Experiments on a publicly available dataset demonstrate that for covered domains, our method can achieve F measure of 0.85, which is a competitive result compared against a supervised solution.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

View Validation: A Case Study for Wrapper Induction and Text Classification

Wrapper induction algorithms, which use labeled examples to learn extraction rules, are a crucial component of information agents that integrate semi-structured information sources. Multi-view wrapper induction algorithms reduce the amount of training data by exploiting several types of rules (i.e., views), each of which being sufficient to extract the relevant data. All multiview algorithms re...

متن کامل

Early Steps Towards Web Scale Information Extraction with LODIE

SPRING 2015 55 Extracting information from a gigantic data source such as the web has been considered a major research challenge, and over the years many different approaches (Etzioni et al. 2004; Banko et al. 2007; Carlson et al. 2010; Freedman and Ramshaw 2011; Nakashole, Theobald, and Weikum 2011) have been proposed. Nevertheless, the current state of the art has mainly addressed tasks for w...

متن کامل

Automatic Wrappers for Large Scale Web Extraction

We present a generic framework to make wrapper induction algorithms tolerant to noise in the training data. This enables us to learn wrappers in a completely unsupervised manner from automatically and cheaply obtained noisy training data, e.g., using dictionaries and regular expressions. By removing the site-level supervision that wrapper-based techniques require, we are able to perform informa...

متن کامل

XPath-Wrapper Induction by generating tree traversal patterns

We introduce a wrapper induction algorithm for extracting information from tree-structured documents like HTML or XML. It derives XPath-compatible extraction rules from a set of annotated example documents. The approach builds a minimally generalized tree traversal pattern, and augments it with conditions. Another variant selects a subset of conditions so that (a) the pattern is consistent with...

متن کامل

Boosted Wrapper Induction

Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple extraction procedures (“wrappers”) for highly structured text such as Web pages produced by CGI scripts. For suitably regular domains, existing wrapper induction a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014